Tiffany Chan

Feature Selection, Model Selection, and Tuning

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
In [349]:
conc = pd.read_csv('concrete.csv')
In [350]:
conc.head(5)
Out[350]:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28 29.89
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14 23.51
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28 29.22
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28 18.29
In [351]:
conc.shape
Out[351]:
(1030, 9)

1. Univariate analysis –data types and description of the independent attributes which should include (name, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions/tails, missing values, outliers, duplicates(10 Marks)

This is a description of the data before any sort of data cleaning, manipulation, or transformation.

In [352]:
#Finding the data types (dtypes) of all variables in the new dataset.
conc.dtypes
Out[352]:
cement          float64
slag            float64
ash             float64
water           float64
superplastic    float64
coarseagg       float64
fineagg         float64
age               int64
strength        float64
dtype: object

Each variable is a continuous variable in the dataset. All are floats, and therefore have decimal values except age. Age has no fractional values or decimals but is nonetheless a continuous variable.

In [353]:
# Descriptive statistics of all variables in the dataset.
# Components shown in the following table: Name, range of values observed( min-max), mean, median (50% quartile), standard deviation, and quartiles.

conc.describe()
Out[353]:
cement slag ash water superplastic coarseagg fineagg age strength
count 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000
mean 281.167864 73.895825 54.188350 181.567282 6.204660 972.918932 773.580485 45.662136 35.817961
std 104.506364 86.279342 63.997004 21.354219 5.973841 77.753954 80.175980 63.169912 16.705742
min 102.000000 0.000000 0.000000 121.800000 0.000000 801.000000 594.000000 1.000000 2.330000
25% 192.375000 0.000000 0.000000 164.900000 0.000000 932.000000 730.950000 7.000000 23.710000
50% 272.900000 22.000000 0.000000 185.000000 6.400000 968.000000 779.500000 28.000000 34.445000
75% 350.000000 142.950000 118.300000 192.000000 10.200000 1029.400000 824.000000 56.000000 46.135000
max 540.000000 359.400000 200.100000 247.000000 32.200000 1145.000000 992.600000 365.000000 82.600000

Understanding the meaning of the independent variables by looking at the codebook and pairing those definitions with descriptive statistics makes sense. Also, researching the topic, can be very useful.

For this assignment, our goal is to understand which variables can predict the strength of concrete. To make concrete, there are 4 main materials: cement, water, coarse aggregate (stones, rocks, etc.), and fine aggregate (sand, finer particulate). For this reason, the means of these columns are a lot higher than the rest.

Age is defined in days.

The other variables like slag (blast furnace slag), ash (or fly ash), and superplastic (or Superplasticizer) can be considered as secondary or additional materials used to strengthen concrete. Slag improves cement workability, improves long-term compressive and flexural performance, reduces permeability, as well as reduce concrete environmental damage. Ash, compared to concrete, absorbs moisture more easily and therefore can make concrete more stable and stronger. Suplerplasticizer is a cement water reducer, but too much superplasticizer can also destroy concrete's chemical composition, and therefore it would make sense that its amount is a lot lower compared to the other materials.

In [365]:
#Histograms of all the continuous variables to see the spread and shape of the data (Analysis of body of distributions/tails)..
columns = list(conc)[:] # Creating a new list with all columns 
conc[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); # Histogram of all columns

Although some variables like ash, slag and superplastic have excessive 0s, they should not be inputed because they are only secondary materials that may not be necessary to the making of concrete. 0 in this case are true values and hold meaning.

Let's look at these plots a little closer.

In [355]:
sns.distplot(conc['cement'])
Out[355]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d18eee0>
In [356]:
sns.distplot(conc['slag'])
Out[356]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d25c220>
In [357]:
sns.distplot(conc['ash'])
Out[357]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d480940>
In [358]:
sns.distplot(conc['water'])
Out[358]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d4deb50>
In [359]:
sns.distplot(conc['superplastic'])
Out[359]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d5613a0>
In [360]:
sns.distplot(conc['coarseagg'])
Out[360]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d5fc8e0>
In [361]:
sns.distplot(conc['fineagg'])
Out[361]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d6798b0>
In [362]:
sns.distplot(conc['age'])
Out[362]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d708c40>
In [366]:
sns.distplot(conc['strength'])
Out[366]:
<matplotlib.axes._subplots.AxesSubplot at 0x17615bf2b50>

None of the variables are normally distributed. Strength is somewhat similar to a normally distributed bell-curve. Other variables seem to have skewed tails on the left and right. This could potentially mean there are outliers in this dataset that we may need to handle.

In [364]:
#Looking for missing values in each variable

conc.isnull().sum()
Out[364]:
cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64

There are no missing values in the dataset.

We will now observe if there are any outliers present.

In [368]:
#Looking for outliers that lie beyond the whiskers of boxplots for every variable.

sns.boxplot(conc['cement'])
Out[368]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761cc9b8e0>
In [369]:
sns.boxplot(conc['slag'])
Out[369]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761ccd09a0>
In [370]:
sns.boxplot(conc['ash'])
Out[370]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761cd2c550>
In [371]:
sns.boxplot(conc['water'])
Out[371]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761cd7e250>
In [372]:
sns.boxplot(conc['superplastic'])
Out[372]:
<matplotlib.axes._subplots.AxesSubplot at 0x176155b1550>
In [373]:
sns.boxplot(conc['coarseagg'])
Out[373]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d174a30>
In [374]:
sns.boxplot(conc['fineagg'])
Out[374]:
<matplotlib.axes._subplots.AxesSubplot at 0x17615d08af0>
In [375]:
sns.boxplot(conc['age'])
Out[375]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761ca913d0>
In [377]:
sns.boxplot(conc['strength'])
Out[377]:
<matplotlib.axes._subplots.AxesSubplot at 0x17615ddfcd0>

There are multiple outliers in many of the variables. We will handle the outliers later in this analysis.

In [379]:
#Duplicates

#test_list = list(set(test_list)) 

#Duplicate cases except the first instance.
concduplicaterows = conc[conc.duplicated(keep='first')]
print(concduplicaterows)
#Find the number of sets of duplicates using groupby function. Disregard the sum function.
print(concduplicaterows.groupby(by=["cement","slag","ash","water","superplastic","coarseagg","fineagg","age","strength"]).sum())
     cement   slag  ash  water  superplastic  coarseagg  fineagg  age  \
278   425.0  106.3  0.0  153.5          16.5      852.1    887.1    3   
298   425.0  106.3  0.0  153.5          16.5      852.1    887.1    3   
400   362.6  189.0  0.0  164.9          11.6      944.7    755.8    3   
420   362.6  189.0  0.0  164.9          11.6      944.7    755.8    3   
463   362.6  189.0  0.0  164.9          11.6      944.7    755.8   56   
468   252.0    0.0  0.0  185.0           0.0     1111.0    784.0   28   
482   425.0  106.3  0.0  153.5          16.5      852.1    887.1   91   
493   362.6  189.0  0.0  164.9          11.6      944.7    755.8   91   
517   425.0  106.3  0.0  153.5          16.5      852.1    887.1   56   
525   362.6  189.0  0.0  164.9          11.6      944.7    755.8   28   
527   425.0  106.3  0.0  153.5          16.5      852.1    887.1   91   
576   362.6  189.0  0.0  164.9          11.6      944.7    755.8    7   
577   425.0  106.3  0.0  153.5          16.5      852.1    887.1   28   
604   362.6  189.0  0.0  164.9          11.6      944.7    755.8   56   
733   362.6  189.0  0.0  164.9          11.6      944.7    755.8   91   
738   362.6  189.0  0.0  164.9          11.6      944.7    755.8   28   
766   362.6  189.0  0.0  164.9          11.6      944.7    755.8   91   
830   425.0  106.3  0.0  153.5          16.5      852.1    887.1    7   
880   425.0  106.3  0.0  153.5          16.5      852.1    887.1   56   
884   425.0  106.3  0.0  153.5          16.5      852.1    887.1    7   
892   362.6  189.0  0.0  164.9          11.6      944.7    755.8   56   
933   362.6  189.0  0.0  164.9          11.6      944.7    755.8    7   
943   362.6  189.0  0.0  164.9          11.6      944.7    755.8    3   
967   362.6  189.0  0.0  164.9          11.6      944.7    755.8   28   
992   425.0  106.3  0.0  153.5          16.5      852.1    887.1   28   

     strength  
278     33.40  
298     33.40  
400     35.30  
420     35.30  
463     77.30  
468     19.69  
482     65.20  
493     79.30  
517     64.30  
525     71.30  
527     65.20  
576     55.90  
577     60.29  
604     77.30  
733     79.30  
738     71.30  
766     79.30  
830     49.20  
880     64.30  
884     49.20  
892     77.30  
933     55.90  
943     35.30  
967     71.30  
992     60.29  
Empty DataFrame
Columns: []
Index: [(252.0, 0.0, 0.0, 185.0, 0.0, 1111.0, 784.0, 28, 19.69), (362.6, 189.0, 0.0, 164.9, 11.6, 944.7, 755.8, 3, 35.3), (362.6, 189.0, 0.0, 164.9, 11.6, 944.7, 755.8, 7, 55.9), (362.6, 189.0, 0.0, 164.9, 11.6, 944.7, 755.8, 28, 71.3), (362.6, 189.0, 0.0, 164.9, 11.6, 944.7, 755.8, 56, 77.3), (362.6, 189.0, 0.0, 164.9, 11.6, 944.7, 755.8, 91, 79.3), (425.0, 106.3, 0.0, 153.5, 16.5, 852.1, 887.1, 3, 33.4), (425.0, 106.3, 0.0, 153.5, 16.5, 852.1, 887.1, 7, 49.2), (425.0, 106.3, 0.0, 153.5, 16.5, 852.1, 887.1, 28, 60.29), (425.0, 106.3, 0.0, 153.5, 16.5, 852.1, 887.1, 56, 64.3), (425.0, 106.3, 0.0, 153.5, 16.5, 852.1, 887.1, 91, 65.2)]

There are duplicates in the dataset. The challenge here is to decide whether they should be removed or not. In making some of the models like linear regression, duplicates can bias the results. However, although these values appear twice or three times in the dataset, we are not really certain whether these are actual duplicates because the dataset contains no ID or Case numbers. These points could be real data that coincidentally have identical values. There could very well be 2 or 3 concrete samples having the same results. In total, there are 11 sets of duplicates in this dataset (25 extra cases). Since 25 cases out of 1030 is only 2.4%, changes made to these cases would not make such a big impact on the statistical findings of the entire dataset.

Before deleting duplicates, one should always consider:

  1. Be certain the duplicate cases is not real data that coincidentally have identical values.
  2. Figure out why there are duplicates in the data.

In this situation, we don't know the answer to these two concerns.

2. Bi-variate analysis between the predictor variables and between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms, or density curves. (10 marks)

In [35]:
sns.pairplot(conc);
In [33]:
sns.heatmap(conc.corr(), annot = True);

After evaluating Pearson's correlation matrix for the variables, there are oly 2 relationships that are moderately correlated.

  1. Water and superplastic are inversely correlated. Pearson's correlation: -0.66, suggesting that an increase of water amount is paired with a decrease in superplasticizer, and vice versa. However, this correlation is not that strong and should not be taken too seriously.
  2. Cement and strength (outcome variable) are also moderately correlated. Pearson's correlation: 0.5, meaning that as the cement amount goes up, the strength also goes up. Like the previous observation, this correlation relationship is only moderate and should not be taken too seriously.
  3. Every other relationship seems to be very weakly correlated.
In [40]:
#Visualizing bivariate analysis of 2 continuous variables (One is an independent variable, and the other is the dependent target variable)
#Strength is the target variable of out analysis.

#Cement x strength
conc_cem_strength = sns.jointplot(x = 'strength', y = 'cement', data = conc, kind = 'reg', color="#4CB391")
conc_cem_strength.set_axis_labels(xlabel='Strength', ylabel='Cement')
Out[40]:
<seaborn.axisgrid.JointGrid at 0x296f9f7deb0>
In [41]:
#Slag x Strength
conc_slag_strength = sns.jointplot(x = 'strength', y = 'slag', data = conc, kind = 'reg', color="#CE75A7")
conc_slag_strength.set_axis_labels(xlabel='Strength', ylabel='Slag')
Out[41]:
<seaborn.axisgrid.JointGrid at 0x296f87d45b0>
In [42]:
#Ash x Strength
conc_ash_strength = sns.jointplot(x = 'strength', y = 'ash', data = conc, kind = 'reg', color="#727ee4")
conc_ash_strength.set_axis_labels(xlabel='Strength', ylabel='Ash')
Out[42]:
<seaborn.axisgrid.JointGrid at 0x296f87fb190>
In [43]:
#Water x Strength
conc_water_strength = sns.jointplot(x = 'strength', y = 'water', data = conc, kind = 'reg', color="#f2013a")
conc_water_strength.set_axis_labels(xlabel='Strength', ylabel='Water')
Out[43]:
<seaborn.axisgrid.JointGrid at 0x296f6f70a00>
In [45]:
#Superplasticizer x Strength
conc_super_strength = sns.jointplot(x = 'strength', y = 'superplastic', data = conc, kind = 'reg', color="#8f1271")
conc_super_strength.set_axis_labels(xlabel='Strength', ylabel='Superplasticizer')
Out[45]:
<seaborn.axisgrid.JointGrid at 0x296f27000d0>
In [49]:
#Coarse Aggregate x Strength
conc_coarseagg_strength = sns.jointplot(x = 'strength', y = 'coarseagg', data = conc, kind = 'reg', color="#642815")
conc_coarseagg_strength.set_axis_labels(xlabel='Strength', ylabel='Coarse Aggregate')
Out[49]:
<seaborn.axisgrid.JointGrid at 0x296f8709700>
In [50]:
#Fine Aggregate x Strength
conc_fineagg_strength = sns.jointplot(x = 'strength', y = 'fineagg', data = conc, kind = 'reg', color="#22962c")
conc_fineagg_strength.set_axis_labels(xlabel='Strength', ylabel='Fine Aggregate')
Out[50]:
<seaborn.axisgrid.JointGrid at 0x296f66c2370>
In [380]:
#Age x Strength
conc_age_strength = sns.jointplot(x = 'strength', y = 'age', data = conc, kind = 'reg', color="#1d2ba5")
conc_age_strength.set_axis_labels(xlabel='Strength', ylabel='Age')
Out[380]:
<seaborn.axisgrid.JointGrid at 0x17615dfa340>

None of the variables seem to fit perfectly on a straight line. However, you never know. It would still be worth it to see how linear regression peforms on this dataset.

3. Feature Engineering techniques(10 marks)

-Identify opportunities (if any) to extract new features from existing features, drop a feature(if required) Hint: Feature Extraction, for example, consider a dataset with two features length and breadth. From this, we can extract a new feature Area which would be length * breadth.

-Get the data model ready and do a train test split.

-Decide on the complexity of the model, should it be a simple linear model in terms of parameters or would a quadratic or higher degree.

There are multiple opportunities to extract new features from existing features in this dataset. In certain cases,

  1. Age can be re-categorized so that we can make a feature that may better generalize the test data. After researching on concrete, age can be broken down into unsatisfactory, minimum, standard, extra number of aging days.

  2. When mixing aggregate (coarse aggregate or stone) with sand (fine aggregate), there is supposed to be a 2:1 ratio. We can create a new feature that captures the ratio between these two features.

  3. There is also a water to cement ratio that exists and it should be 0.45-0.6. We can create another feature that captures the quatitifying relationship between these two features.

  4. In terms of the superplasticizer, the percentage generally ranges from 0-7% of the entire mixture of all concrete components. Perhaps, we can create a new feature looking at this percentage in relation to the entire mixture in kg/m^3. Since all the predictor variables have the same units of measurement, we can easily calculate this.

In [382]:
#New extracted feature from 'age': 'agedays'

#Finding mode for age to see if the data agrees with the literature online. 
#The standard number of days it takes to make concrete according to literature is 28 days.
import statistics
print("Mode for age variable:")
print(statistics.mode(conc['age']))
#The mode shows that most of the cases in this dataset followed the standard number of concrete ageing days.
print("")
#Review the frequency of the other number of days.
print("Frequency for age variable:")
print(conc['age'].value_counts())

#After doing some research on how the number of days can improve compressive strength, I discovered 28 days is the arbitrary standard number of days.
#7 days equals 75% of concrete's compressive strength, which is considered the minumum accepted compressive strength.
#Less than 7 days, the strength is not satisfactory.
#More than 28 days allows for assured solidification.


#Re-categorizing 'age' and naming it 'agedays'
#conc['agedays'] = conc['age']
bins = [1, 6, 7, 27, 28, 365]
labels = ['1-6', '7', '8-27', '28', '29-365']
conc['agedays'] = pd.cut(conc.age, bins, labels = labels,include_lowest = True)
#Verifying the new categorized variable:
print(conc['agedays'].value_counts())
#Since this variable can be construed as ordinal, we can replace it with ordinval values
replaceStruct = {"agedays":     {"1-6": 1, "7": 2 ,"8-27": 3 , "28": 4, "29-365": 5}}
conc=conc.replace(replaceStruct)
conc.head(10)
Mode for age variable:
28

Frequency for age variable:
28     425
3      134
7      126
56      91
14      62
90      54
100     52
180     26
91      22
365     14
270     13
360      6
120      3
1        2
Name: age, dtype: int64
28        425
29-365    281
1-6       136
7         126
8-27       62
Name: agedays, dtype: int64
Out[382]:
cement slag ash water superplastic coarseagg fineagg age strength agedays
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28 29.89 4
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14 23.51 3
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28 29.22 4
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85 4
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28 18.29 4
5 255.0 0.0 0.0 192.0 0.0 889.8 945.0 90 21.86 5
6 166.8 250.2 0.0 203.5 0.0 975.6 692.6 7 15.75 2
7 251.4 0.0 118.3 188.5 6.4 1028.4 757.7 56 36.64 5
8 296.0 0.0 0.0 192.0 0.0 1085.0 765.0 28 21.65 4
9 155.0 184.0 143.0 194.0 9.0 880.0 699.0 28 28.99 4
In [383]:
#Capturing the coarse aggregate to sand ratio. Standard cement:sand:agg ratio is 1:2:4. Dividing coarse aggregate by fine aggregate would normally result in 2.
conc['agg_sand_ratio'] = conc['coarseagg']/conc['fineagg'] # Standard should be 2. Between 1-2 suggests more rock than sand but not standard measurements. 1 means equal sand:rock measurement. Less than 1 suggests more sand to rock (normally for smaller delicate projects like making a fountain/)
conc.head(10)
Out[383]:
cement slag ash water superplastic coarseagg fineagg age strength agedays agg_sand_ratio
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28 29.89 4 1.298330
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14 23.51 3 1.357448
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28 29.22 4 1.111124
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85 4 1.391045
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28 18.29 4 1.503373
5 255.0 0.0 0.0 192.0 0.0 889.8 945.0 90 21.86 5 0.941587
6 166.8 250.2 0.0 203.5 0.0 975.6 692.6 7 15.75 2 1.408605
7 251.4 0.0 118.3 188.5 6.4 1028.4 757.7 56 36.64 5 1.357265
8 296.0 0.0 0.0 192.0 0.0 1085.0 765.0 28 21.65 4 1.418301
9 155.0 184.0 143.0 194.0 9.0 880.0 699.0 28 28.99 4 1.258941
In [384]:
#Visualizing the new feature using distplot
sns.distplot(conc['agg_sand_ratio']);

The distplot for the agg_sand_ratio suggests that most projects did not follow standard mixing amounts of coarse to fine aggregate. Most cases still had more rock than sand (Cases > 1.0). There are fewer cases where the rock:sand ratio is < 1; those are most likely small projects that require more sand in the mixture.

In [385]:
#Water:Cement ratio
conc['water_cem_ratio'] = conc['water']/conc['cement']
conc.head(10)
Out[385]:
cement slag ash water superplastic coarseagg fineagg age strength agedays agg_sand_ratio water_cem_ratio
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28 29.89 4 1.298330 1.440198
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14 23.51 3 1.357448 0.937241
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28 29.22 4 1.111124 0.749600
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85 4 1.391045 0.857143
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28 18.29 4 1.503373 1.248708
5 255.0 0.0 0.0 192.0 0.0 889.8 945.0 90 21.86 5 0.941587 0.752941
6 166.8 250.2 0.0 203.5 0.0 975.6 692.6 7 15.75 2 1.408605 1.220024
7 251.4 0.0 118.3 188.5 6.4 1028.4 757.7 56 36.64 5 1.357265 0.749801
8 296.0 0.0 0.0 192.0 0.0 1085.0 765.0 28 21.65 4 1.418301 0.648649
9 155.0 184.0 143.0 194.0 9.0 880.0 699.0 28 28.99 4 1.258941 1.251613
In [386]:
#Visualizing water:cement ratio
sns.distplot(conc['water_cem_ratio']);

Normally, this value is around 0.45-0.6. This distplot seems to reflect that. There are a lot of cases that are crowding in this 0.46-0.6 range, suggesting a lot of the cases in this dataset did follow for the most part standard water:cement mixing.

Now we should try to take care of those outliers in the different variables. This is part of EDA and feature engineering.

In [388]:
#Scale to Range Feature Engineering Technique to Create a New Feature from Water Column.
#This technique doesn't work well with outliers. So, we will have to find a way to deal with the outliers for the water variable.

#Imputing the whisker values in boxplot for the outliers in water.
conc[['water']].boxplot()

q75, q25 = np.percentile(conc['water'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
27.099999999999994
75th percentile
192.0
25th percentile
164.9
Upper whisker
232.64999999999998
Lower whisker
124.25000000000001
In [391]:
#RUN THIS CELL TWICE TO HANDLE OUTLIERS ON BOTH ENDS OF THE WHISKERS
#Imputing the outliers above the upper whisker with the upper whisker value
conc.water[conc['water']>232.64] = 232.6499
conc[['water']].boxplot()
#Imputing the outliers below the lower whisker with the lower whisker value. Run twice to make sure.
conc.water[conc['water']<124.25000000000002] = 124.25000000000001
conc[['water']].boxplot()
Out[391]:
<matplotlib.axes._subplots.AxesSubplot at 0x17615f13640>
In [392]:
#Imputing the whisker values in boxplot for the outliers in water.
conc[['slag']].boxplot()

q75, q25 = np.percentile(conc['slag'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
142.95
75th percentile
142.95
25th percentile
0.0
Upper whisker
357.375
Lower whisker
-214.42499999999998
In [393]:
#Imputing the outliers above the upper whisker with the upper whisker value
conc.slag[conc['slag']> 357.375] = 357.375
conc[['slag']].boxplot()
Out[393]:
<matplotlib.axes._subplots.AxesSubplot at 0x176182ecb20>
In [394]:
#Imputing the whisker values in boxplot for the outliers in water.
conc[['superplastic']].boxplot()

q75, q25 = np.percentile(conc['superplastic'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
10.2
75th percentile
10.2
25th percentile
0.0
Upper whisker
25.5
Lower whisker
-15.299999999999999
In [395]:
#Imputing the outliers above the upper whisker with the upper whisker value
conc.superplastic[conc['superplastic']> 25.5] = 25.5
conc[['superplastic']].boxplot()
Out[395]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761d223640>
In [396]:
#Imputing the whisker values in boxplot for the outliers in water.
conc[['fineagg']].boxplot()

q75, q25 = np.percentile(conc['fineagg'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
93.05000000000007
75th percentile
824.0
25th percentile
730.9499999999999
Upper whisker
963.575
Lower whisker
591.3749999999998
In [397]:
#Imputing the outliers above the upper whisker with the upper whisker value
conc.fineagg[conc['fineagg']> 963.575] = 963.575
conc[['fineagg']].boxplot()
Out[397]:
<matplotlib.axes._subplots.AxesSubplot at 0x17615352c40>
In [398]:
#Imputing the whisker values in boxplot for the outliers in water.
conc[['age']].boxplot()

q75, q25 = np.percentile(conc['age'], [75 ,25])
iqr = q75 - q25

#Interquartile range:
print("Interquartile range:")
print(iqr)
print("75th percentile")
print(q75)
print("25th percentile")
print(q25)
print("Upper whisker")
print(q75 + 1.5*iqr)
print("Lower whisker")
print(q25 - 1.5*iqr)
Interquartile range:
49.0
75th percentile
56.0
25th percentile
7.0
Upper whisker
129.5
Lower whisker
-66.5
In [399]:
#Imputing the outliers above the upper whisker with the upper whisker value
conc.age[conc['age']> 129.5] = 129.5
conc[['age']].boxplot()
Out[399]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761c67a160>
In [401]:
#Making the new feature for superplasticizer percent.

conc['superp_percent'] = conc['superplastic']/(conc['cement'] + conc['slag'] + conc['ash'] + conc['water'] + conc['superplastic'] + conc['coarseagg'] + conc['fineagg'])*100
conc.head(10)
Out[401]:
cement slag ash water superplastic coarseagg fineagg age strength agedays agg_sand_ratio water_cem_ratio superp_percent
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28.0 29.89 4 1.298330 1.440198 0.000000
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14.0 23.51 3 1.357448 0.937241 0.453496
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28.0 29.22 4 1.111124 0.749600 0.233377
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28.0 45.85 4 1.391045 0.857143 0.000000
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28.0 18.29 4 1.503373 1.248708 0.398302
5 255.0 0.0 0.0 192.0 0.0 889.8 945.0 90.0 21.86 5 0.941587 0.752941 0.000000
6 166.8 250.2 0.0 203.5 0.0 975.6 692.6 7.0 15.75 2 1.408605 1.220024 0.000000
7 251.4 0.0 118.3 188.5 6.4 1028.4 757.7 56.0 36.64 5 1.357265 0.749801 0.272259
8 296.0 0.0 0.0 192.0 0.0 1085.0 765.0 28.0 21.65 4 1.418301 0.648649 0.000000
9 155.0 184.0 143.0 194.0 9.0 880.0 699.0 28.0 28.99 4 1.258941 1.251613 0.397527
In [402]:
sns.distplot(conc['superp_percent'])
Out[402]:
<matplotlib.axes._subplots.AxesSubplot at 0x1761c97eb80>

1. Algorithms that you think will be suitable for this project. Use Kfold Cross-Validation to evaluate model performance. Use appropriate metrics and make a DataFrame to compare models w.r.t their metrics. (at least 3 algorithms, one bagging and one boosting based algorithms have to be there). (15 marks)

It's important to note that certain regression models are very sensitive to multicollinearity, outliers, and non-standardized/non-normalized data columns. These models include: linear regression, Support Vector Regression, and K Nearest Neighbors. Tree models like Decision Tree and Random Forest are robust enough, where these factors don't really influence the measures. Bagging, and boosting models whose base estimators are decision tree are also robust enough.

I created a pipeline for linear regression, support vector regression and K Nearest Neighbors that contain a scaler(Standardscaler)/normalizer(MinMaxScaler) that will standardize/normalize the data.

Accuracy for a regression model is the R^2 value. Just like measuring accuracy in a classification problem, R^2 is not the most ideal metric to understand how well the model fits the data. R Square can determine how well the model fits the dependent variables and explains how much of the variance in the dependent variable can be explained by the independent variable(s) in the model. However, it does not take into consideration the issue of overfitting. Mean Squared Error, on the hand, is better at measuring a model's goodness of fit and reports underfitting and overfitting. Both mean squared error and root mean square share the same purpose, and in this assignment mean squared error will be measured.

Overall, both R^2 and mean squared error and mean absolute error measure a model's statistical robustness.

LINEAR REGRESSION BASIC MODEL:

In [403]:
#Importing all the necessary packages for our model.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
import warnings
warnings.filterwarnings('ignore')

#Separate independent features from target feature.
X = conc.drop(['age','strength'], axis=1)
# the dependent variable
Y = conc[['strength']]

#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)

#The problem with the linear regression model is that the independent variables have to be standardized so that each variable is on the same scale to limit bias towards features with larger value ranges. 
#All the original variables (except for age) all have the same units of measurement and would normally not need to be standardized. 
#Because we converted age to an ordinal variable and it is no longer continuous, ordinal variables do not need to be standardized.
#HOWEVER, because I created all these new features that are ratios, and percentages, and are no longer in the same scale as the other continuous variables, using a scaler is important so that results from the linear regression model are not biased towards the features with the higher range values.

#Create pipeline that has standardization scaler. 
pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', LinearRegression())
])

#Fit the model created from pipeline onto the training data.
pipeline.fit(x_train, y_train)



for idx, col_name in enumerate(x_train.columns):
    print("The coefficient for {} is {}".format(col_name, pipeline['clf'].coef_[0][idx]))
The coefficient for cement is 12.156184410930297
The coefficient for slag is 8.058761346145825
The coefficient for ash is 4.882719756287875
The coefficient for water is -1.8415832196362958
The coefficient for superplastic is 64.3825420610818
The coefficient for coarseagg is 5.764183438974145
The coefficient for fineagg is -5.061753699206828
The coefficient for agedays is 10.393514196189281
The coefficient for agg_sand_ratio is -8.782039997050243
The coefficient for water_cem_ratio is -0.5035254199048541
The coefficient for superp_percent is -63.539514505925474

From this list of linear regresssion predicted coefficients, the newly made water_cem_ratio is the least contributive to the model, and may need to be eliminated in the revised version of the model after tuning the hyperparameters. The superplasticizer percentage variable is the most contributive of all features, followed by cement, and then by agedays (another new feature that captures the order of concrete ageing day periods.)

In [404]:
#R^2 value for training data
print("R-squared value for training data")
print(pipeline.score(x_train, y_train)) #Score for linear regression returns the R^2 coefficient (according to sklearn documentation)
print("")


#Cross-validation for R^2 on Validation Set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
lrscores = cross_val_score(pipeline, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(lrscores)  
print("Average of all R^2 iterations and 2 standard deviations")
print("R^2: %.3f (%.3f)" % (rfscores.mean(), rfscores.std()*2))


#Mean Squared Error:
print("Mean Squared Error")
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import cross_val_score
print("Calculate mean squared error for training data")
lr_concrete_strength_predict = pipeline.predict(x_train)
print("Mean Squared Error for our Training data")
print(mean_squared_error(y_train, lr_concrete_strength_predict))
print("Mean Squared Error for our Validation data")
print(mean_squared_error(y_val, pipeline.predict(x_val)))
R-squared value for training data
0.8312567290441221

R^2 results for 10 iterations
[0.49500121 0.88748962 0.81412736 0.64021046 0.83118787 0.89682643
 0.71819282 0.85588302 0.87534982 0.84404998 0.82500785 0.73387903
 0.89480423 0.72256119 0.89039852 0.7468181  0.79179355 0.84320081
 0.8367311  0.86729432 0.82581244 0.81731402 0.75455167 0.61305814
 0.81260572 0.61331612 0.58807904 0.66572947 0.79207257 0.75600902
 0.90613146 0.64534671 0.75560225 0.86538561 0.83663782 0.76525066
 0.82212158 0.860319   0.8304249  0.68294217 0.6613802  0.72482818
 0.90642822 0.92492303 0.77060533 0.90282042 0.72740158 0.78238779
 0.89935634 0.73020525]
Average of all R^2 iterations and 2 standard deviations
R^2: 0.834 (0.192)
Mean Squared Error
Calculate mean squared error for training data
Mean Squared Error for our Training data
44.32846059672197
Mean Squared Error for our Validation data
51.49516499111835

Looking at the results, R-squared values are close for the training and validation data. The mean squared error for training data was 44.3, and the mean squared error on the validation set was 51.5 on the validation set using k-fold cross-validation. From this, we can see there is a little bit of overfitting. Looking at the standard deviation from cross-validation, R^2 from the training data falls in the 95% confidence interval (calculated from the 2 standard deviations) from cross-validation.

This suggests that the model is tolerable in its goodness of fit, but there is still some overfitting issues that could be resolved by reducing some of the less important polynomial features in the linear equation.

In [405]:
#Heatmap to Measure Correlation (including new features)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(conc.corr(), annot = True);

The heatmap shows that the new features (water_cem_ratio, and agg_sand_ratio) are highly correlated (>0.8). For this reason, they should be eliminated from the linear regression model because linear regression does not handle multicollinearity well. The water to cement ratio also does not contribute much towards the linear model from the features of importance list, so that would be another reason to remove this variable from the model.

In [406]:
#Model Tuning for linear regression

#Separate independent features from target feature.
X = conc.drop(['age','strength','water_cem_ratio','agg_sand_ratio'], axis=1) #Removing features that are less contributive and reducing collinearity.
# the dependent variable
Y = conc[['strength']]

#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)

#The problem with the linear regression model is that the independent variables have to be standardized so that each variable is on the same scale to limit bias towards features with larger value ranges. 
#All the original variables (except for age) all have the same units of measurement and would normally not need to be standardized. 
#Because we converted age to an ordinal variable and it is no longer continuous, ordinal variables do not need to be standardized.
#HOWEVER, because I created all these new features that are ratios, and percentages, and are no longer in the same scale as the other continuous variables, using a scaler is important so that results from the linear regression model are not biased towards the features with the higher range values.

#Create pipeline that has standardization scaler. 
pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', LinearRegression())
])

#Fit the model created from pipeline onto the training data.
pipeline.fit(x_train, y_train)



for idx, col_name in enumerate(x_train.columns):
    print("The coefficient for {} is {}".format(col_name, pipeline['clf'].coef_[0][idx]))
The coefficient for cement is 12.097569191292479
The coefficient for slag is 7.84415368949313
The coefficient for ash is 5.000957325455943
The coefficient for water is -2.1477894983257686
The coefficient for superplastic is 65.35989803076207
The coefficient for coarseagg is 0.9898676980911577
The coefficient for fineagg is 1.1996543853616928
The coefficient for agedays is 10.35224935899362
The coefficient for superp_percent is -64.69308876173696

Interpretation of the 2 most contributive coefficients:

For every superplasticizer percent increase, there is a decrease of 64.7 MPa in concrete strength. For every 1kg/m^3 increase in cement, there is a 12.1 MPA increase in concrete strength.

LINEAR REGRESSION AFTER HYPERPARAMETER TUNING

In [407]:
#R^2 value for training data
print("R-squared value for training data")
R2_train= pipeline.score(x_train, y_train) #Score for linear regression returns the R^2 coefficient (according to sklearn documentation)
print(R2_train)

#Cross-validation for R^2 on Validation Set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
lrscores = cross_val_score(pipeline, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(lrscores)  
print("Average of all R^2 iterations and 2 standard deviations")
R2_val_cross_val = "R^2: %.3f (%.3f)" % (lrscores.mean(), lrscores.std()*2)
print(R2_val_cross_val)

#Mean Squared Error:
print("Mean Squared Error")
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import cross_val_score
print("Calculate mean squared error for training data")
lr_concrete_strength_predict = pipeline.predict(x_train)
print("Mean Squared Error for our Training data")
MSE_train = mean_squared_error(y_train, lr_concrete_strength_predict)
print(MSE_train)
print("Mean Squared Error for our Validation data")
MSE_val = mean_squared_error(y_val, pipeline.predict(x_val))
print(MSE_val)
print("")
R-squared value for training data
0.825892037097684
R^2 results for 10 iterations
[0.43750126 0.8705789  0.80487896 0.59988899 0.81320912 0.88575577
 0.75554468 0.85071066 0.8617748  0.86095272 0.81619182 0.70499163
 0.88408804 0.70537341 0.88633678 0.75710736 0.81360468 0.84833745
 0.81904081 0.86696338 0.82000528 0.80964409 0.79088474 0.65472647
 0.83309979 0.6488868  0.5685653  0.70891859 0.76478238 0.71156175
 0.90183114 0.66380675 0.77041145 0.85083585 0.82686806 0.7623406
 0.80555048 0.84520852 0.82699928 0.64176434 0.6730116  0.68555971
 0.8997524  0.91223488 0.83809611 0.87629162 0.71189405 0.80514788
 0.89291275 0.7387858 ]
Average of all R^2 iterations and 2 standard deviations
R^2: 0.782 (0.196)
Mean Squared Error
Calculate mean squared error for training data
Mean Squared Error for our Training data
45.7377525596792
Mean Squared Error for our Validation data
51.40988820314021

We reduced the overfitting by a little, but it seems that there will be a little overfitting nonetheless. The mean squared error is still high for training and from cross-validation, suggesting that linear regression may not be the best fit model for this data, despite the ability to reduce the overfitting by a little. We can now test the model on the test data.

In [408]:
print("R-squared for test data")
R2_test = pipeline.score(x_test,y_test)
print(R2_test)

print("MSE of model on test data")
MSE_test = mean_squared_error(y_test, pipeline.predict(x_test))
print(MSE_test)
R-squared for test data
0.8292322168089732
MSE of model on test data
50.285855723347446

Clearly, the R^2 from the test data is within the range of the 95% confidence interval (0.586-0.978) that we achieved in the cross-validation. So, even though the model is a little overfit, linear regression is still a tolerable model (not the best fit model) to carry into production for predicting the strength of concrete with the more important features.

In [409]:
Linear_Regression ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Linear_Regression':[R2_train, R2_val_cross_val, R2_test, MSE_train, MSE_val, MSE_test]}
Linear_Regression_= pd.DataFrame(Linear_Regression)
Linear_Regression_
Out[409]:
Metrics Linear_Regression
0 R2_train 0.825892
1 R2_val_cross_val (+2 sd) R^2: 0.782 (0.196)
2 R2_test 0.829232
3 MSE_train 45.7378
4 MSE_val 51.4099
5 MSE_test 50.2859
In [410]:
#Reset the x dataset back to the form before the linear regression model tuning because tree models like decision tree and random forest are robust.

#Separate independent features from target feature.
X = conc.drop(['age','strength'], axis=1)
# the dependent variable
Y = conc[['strength']]

#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)

DECISION TREE BASIC MODEL

In [411]:
#Decision Tree
#Decision tree doesn't require any standardization because the model is robust to the magnitude of variables.
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
dTree = DecisionTreeRegressor(random_state=1, criterion='mse', splitter='best', max_depth=10, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, presort='deprecated', ccp_alpha=0.0)
dTree.fit(x_train, y_train)
Out[411]:
DecisionTreeRegressor(max_depth=10, random_state=1)
In [412]:
print("Decision Tree R^2 value for training data")
dt_R2_train= dTree.score(x_train,y_train)#"Score" is the coefficient of determination R^2 of the prediction for DecisionTreeRegression, according to sklearn
print(dt_R2_train)


#Cross-validation for R^2 on Validation Set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
dtscores = cross_val_score(dTree, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(dtscores)  
print("Average of all R^2 iterations and 2 standard deviation")
dt_R2_val_cross_val= "R^2: %.3f (%.3f)" % (dtscores.mean(), dtscores.std()*2)
print(dt_R2_val_cross_val)


#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
print("Calculate mean squared error for training data")
dt_concrete_strength_predict = dTree.predict(x_train)
print("Mean Squared Error for our Training data")
dt_MSE_train= mean_squared_error(y_train, dt_concrete_strength_predict)
print(dt_MSE_train)
print("Mean Squared Error for Validation Data")
dt_MSE_val= mean_squared_error(y_val, dTree.predict(x_val))
print(dt_MSE_val)
Decision Tree R^2 value for training data
0.9845081381122976
R^2 results for 10 iterations
[0.66732121 0.86491564 0.9362959  0.89443349 0.42735141 0.94152559
 0.74577406 0.83734122 0.86021603 0.73433346 0.76683349 0.44488648
 0.8224007  0.79638105 0.77523077 0.80242703 0.77933661 0.91420223
 0.87935185 0.82708626 0.96427124 0.40548425 0.27621146 0.67530468
 0.85192959 0.17241255 0.80005393 0.7690281  0.96303399 0.79788153
 0.91718918 0.76031192 0.88795739 0.92748808 0.861259   0.839145
 0.90822    0.58365673 0.77684555 0.88152364 0.75620961 0.65166122
 0.82959368 0.87207969 0.93173575 0.72025565 0.79484985 0.62426214
 0.9615928  0.7286571 ]
Average of all R^2 iterations and 2 standard deviation
R^2: 0.772 (0.341)
Calculate mean squared error for training data
Mean Squared Error for our Training data
4.069675699473874
Mean Squared Error for Validation Data
48.15283769446304

This is definitely an overfit model. The difference between the R^2 values may not be that much different (0.996 vs. 0.823) but the difference is stark between the mean squared error for the training data and the negative mean squared error from the validation set. Overfitting the training data makes the mean squared error approach 0,and the mean squared error from the validation set is so far from 0. The confidence interval calculated from cross validation, despite its negative sign, does not even cover the training MSE.

We must prune the decision tree model or tune the hyperparameters. The most influential hyperparameters that reduce overfitting for a decision tree model are: max_depth and min_impurity_decrease. To do this, we would ideally want to decrease the max depth and increase the min_impurity_decrease. Since max_depth was set to none in our decision tree model from before, we should decrease it to 6 and increase the min_impurity_decrease to 1. These results were generated by hyperparameter tuning based off of training and validation prediction R^2 scores and the MSE scores from cross validation.

In [413]:
dTree = DecisionTreeRegressor(max_depth = 6, random_state=1, min_impurity_decrease=1, min_samples_leaf=2)
dTree.fit(x_train, y_train)

#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

print("Decision Tree R^2 value for training data")
dt_R2_train= dTree.score(x_train,y_train)#"Score" is the coefficient of determination R^2 of the prediction for DecisionTreeRegression, according to sklearn
print(dt_R2_train)


#Cross-validation for R^2 on Validation Set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
dtscores = cross_val_score(dTree, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(dtscores)  
print("Average of all R^2 iterations and 2 standard deviation")
dt_R2_val_cross_val= "R^2: %.3f (%.3f)" % (dtscores.mean(), dtscores.std()*2)
print(dt_R2_val_cross_val)


#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
print("Calculate mean squared error for training data")
dt_concrete_strength_predict = dTree.predict(x_train)
print("Mean Squared Error for our Training data")
dt_MSE_train= mean_squared_error(y_train, dt_concrete_strength_predict)
print(dt_MSE_train)
print("Mean Squared Error for Validation Data")
dt_MSE_val= mean_squared_error(y_val, dTree.predict(x_val))
print(dt_MSE_val)





#scores2 = cross_val_score(dTree, x_val, y_val, scoring='r2', cv=folds)
#print(scores2)

#print("R^2: %.3f (%.3f)" % (scores2.mean(), scores2.std()))

#print(dTree.score(x_test, y_test))
Decision Tree R^2 value for training data
0.8272750486723355
R^2 results for 10 iterations
[0.28902517 0.87090534 0.8193936  0.75043535 0.45731933 0.91210529
 0.58079906 0.7947995  0.72103025 0.72959973 0.69360609 0.57248136
 0.80290432 0.47663619 0.80702032 0.55607129 0.60290424 0.84912458
 0.67082754 0.87099285 0.8671913  0.74188531 0.17902058 0.83276509
 0.7349286  0.55404574 0.61624689 0.6891639  0.91497365 0.39633155
 0.7792286  0.39033547 0.85395616 0.8355274  0.74241703 0.71251322
 0.78988822 0.75463169 0.83509389 0.81199121 0.69810994 0.83490425
 0.75472382 0.8112971  0.6204671  0.85986869 0.53500205 0.74970594
 0.92108858 0.15400186]
Average of all R^2 iterations and 2 standard deviation
R^2: 0.696 (0.361)
Calculate mean squared error for training data
Mean Squared Error for our Training data
45.374438670215845
Mean Squared Error for Validation Data
55.5331690417149

The R^2 isn't the best method to evaluate overfitting. Looking at the R^2 values for this newly revised model, the R^2 values for the training and cross validation are closer. The mean squared errors for the validation set and the training set are not that far apart either, so there is just a little bit of overfitting. No matter how much I prune and tune the hyperparameters, it will still be a little overfit. However, we tuned the model so that the MSE for the training is not so close to 0, and is able to be generalizable to the validation set, and possibly the test set. The model is decent to carry out to production, if the R^2 values displayed here are considered somewhat acceptable.

In [414]:
print("R-squared for test data")
dt_R2_test = dTree.score(x_test,y_test)
print(dt_R2_test)

print("MSE of model on test data")
dt_MSE_test = mean_squared_error(y_test, dTree.predict(x_test))
print(dt_MSE_test)
R-squared for test data
0.796746880002215
MSE of model on test data
59.851787477357746

After testing on the testing data, the R^2 is within the acceptable 95% Confidence interval but the model is still somewhat overfit because the MSE is slightly more than the training and the validation test set.

In [415]:
Decision_Tree ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Decision_Tree':[dt_R2_train, dt_R2_val_cross_val, dt_R2_test, dt_MSE_train, dt_MSE_val, dt_MSE_test]}
Decision_Tree_= pd.DataFrame(Decision_Tree)
Decision_Tree_
Out[415]:
Metrics Decision_Tree
0 R2_train 0.827275
1 R2_val_cross_val (+2 sd) R^2: 0.696 (0.361)
2 R2_test 0.796747
3 MSE_train 45.3744
4 MSE_val 55.5332
5 MSE_test 59.8518
In [ ]:
 

RANDOM FOREST

In [416]:
#Random Forest
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(random_state = 100, n_estimators=100, max_depth=None, bootstrap=True)
rfr.fit(x_train, y_train)
#R-Square for Training Data
print("R-squared for training data")
rf_R2_train = rfr.score(x_train,y_train)
print(rf_R2_train)

#Cross-validation for R^2.
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
rfscores = cross_val_score(rfr, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(rfscores)  
print("Average of all R^2 iterations and 2 standard deviations")
rf_R2_val_cross_val= "R^2: %.3f (%.3f)" % (rfscores.mean(), rfscores.std()*2)
print(rf_R2_val_cross_val)
print("")

#Mean Squared Error:
print("Mean Squared Error")
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import cross_val_score
print("Calculate mean squared error for training data")
rf_concrete_strength_predict = rfr.predict(x_train)
print("Mean Squared Error for our Training data")
rf_MSE_train = mean_squared_error(y_train, rf_concrete_strength_predict)
print(rf_MSE_train)

print("Mean Squared Error for our Validation data")
rf_MSE_val= mean_squared_error(y_val, rfr.predict(x_val))
print(rf_MSE_val)
R-squared for training data
0.9836715318698851
R^2 results for 10 iterations
[0.6896989  0.91720805 0.95472043 0.9301504  0.89653824 0.94704627
 0.91041333 0.93374777 0.86649732 0.91760436 0.78582571 0.77104953
 0.91648106 0.86930442 0.85282187 0.8470381  0.90680358 0.94568716
 0.9274303  0.89318936 0.97052765 0.8447633  0.56352674 0.8082026
 0.87697767 0.7043279  0.92398224 0.86431879 0.97036973 0.92352806
 0.9117169  0.87409063 0.9502558  0.95127327 0.95075582 0.8564797
 0.96304259 0.87164297 0.86050149 0.95762989 0.82625561 0.86748879
 0.94080117 0.95298587 0.88022116 0.93827083 0.82516207 0.76867072
 0.9639309  0.74735162]
Average of all R^2 iterations and 2 standard deviations
R^2: 0.880 (0.163)

Mean Squared Error
Calculate mean squared error for training data
Mean Squared Error for our Training data
4.289450192653249
Mean Squared Error for our Validation data
28.02296003682836

Based on these validation results, the R^2 value went down compared to the R^2 value for the training set. This shows that the model is not able to capture as much of the variance around the mean in the validation set.

The overfitting issue is corroborated by the values of the mean squared error, which is a better measure to examine goodness-of-fit for the model. The mean squared error was 4.3 on the training data but is 28 for the validation data. This is a stark difference, explaining how much the model is overfit. The fact how the training MSE is so close to 0 means that the data is so well-trained to the training data.

I will tune the hyperparameters to make the model more generalizable so that it may perform decently on the test data. To prevent overfitting in a random forest model, it is advised to increase the n_estimators (trees), and decrease the max_depth.

In [417]:
rfr = RandomForestRegressor(random_state = 100, n_estimators=120, max_depth=6)
rfr.fit(x_train, y_train)


#R-Square for Training Data
print("R-squared for training data")
rf_R2_train = rfr.score(x_train,y_train)
print(rf_R2_train)

#Cross-validation for R^2.
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
rfscores = cross_val_score(rfr, x_train, y_train, scoring='r2', cv=folds)
print("R^2 results for 10 iterations")
print(rfscores)  
print("Average of all R^2 iterations and 2 standard deviations")
rf_R2_val_cross_val= "R^2: %.3f (%.3f)" % (rfscores.mean(), rfscores.std()*2)
print(rf_R2_val_cross_val)
print("")

#Mean Squared Error:
print("Mean Squared Error")
#Mean squared error is to evaluate goodness of fit of the model on the training and validation data.
from sklearn.metrics import mean_squared_error
#from sklearn.model_selection import cross_val_score
print("Calculate mean squared error for training data")
rf_concrete_strength_predict = rfr.predict(x_train)
print("Mean Squared Error for our Training data")
rf_MSE_train = mean_squared_error(y_train, rf_concrete_strength_predict)
print(rf_MSE_train)

print("Mean Squared Error for our Validation data")
rf_MSE_val= mean_squared_error(y_val, rfr.predict(x_val))
print(rf_MSE_val)
R-squared for training data
0.9311049520465651
R^2 results for 10 iterations
[0.54525773 0.90095664 0.90610546 0.87298828 0.85274452 0.93929626
 0.85210048 0.90711975 0.85123075 0.8059539  0.78598554 0.77200947
 0.89672233 0.83677191 0.81899107 0.75728639 0.82121913 0.92269417
 0.8706598  0.88926478 0.90234723 0.81409264 0.48962053 0.80484296
 0.8435486  0.5758484  0.76487369 0.83389207 0.94615976 0.8317107
 0.898843   0.8092331  0.92227237 0.92570881 0.9147003  0.82607382
 0.89325272 0.81325203 0.839608   0.92057889 0.78954688 0.87402233
 0.90824758 0.89703667 0.80891014 0.93647746 0.75366215 0.7413066
 0.93120596 0.68233396]
Average of all R^2 iterations and 2 standard deviations
R^2: 0.834 (0.192)

Mean Squared Error
Calculate mean squared error for training data
Mean Squared Error for our Training data
18.09856713819219
Mean Squared Error for our Validation data
36.10356744096461

Compared to the model prior to hypertuning, this model is not as overfit. The difference between the mean squared errors is now only 16. This is the best random forest model obtained juggling with the tradeoffs of each hyperparameter. The increase in the training MSE suggests that the model is more generalizable now after hypertuning compared to the previous overfit model.

In [418]:
#Testing on the test data

print("R-squared for test data")
rf_R2_test = rfr.score(x_test,y_test)
print(rf_R2_test)

print("MSE of model on test data")
rf_MSE_test = mean_squared_error(y_test, rfr.predict(x_test))
print(rf_MSE_test)
R-squared for test data
0.8684480655657794
MSE of model on test data
38.737995372853696

The MSE for the testing data is a lot closer to the validation set MSE. The model is overfit. I tuned the parameters that are most influential for Random Forest to the best of my ability but the model still remains somewhat overfit.

In [419]:
Random_Forest ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Random_Forest':[rf_R2_train, rf_R2_val_cross_val, rf_R2_test, rf_MSE_train, rf_MSE_val, rf_MSE_test]}
Random_Forest_= pd.DataFrame(Random_Forest)
Random_Forest_
Out[419]:
Metrics Random_Forest
0 R2_train 0.931105
1 R2_val_cross_val (+2 sd) R^2: 0.834 (0.192)
2 R2_test 0.868448
3 MSE_train 18.0986
4 MSE_val 36.1036
5 MSE_test 38.738

SVR

Support vector regression is also a possible regression choice to determine the strength of concrete. For this model, like linear regression, the dataframe has to be standardized so that all features are of the same scale and similar ranges. Additionally, we must re-remove features of multicollinearity like we did in linear regression.

In [420]:
#Remove the features of multicollinearity (like what we did for the regression model)

#Separate independent features from target feature.
X = conc.drop(['age','strength','water_cem_ratio','agg_sand_ratio'], axis=1) #Removing features that are less contributive and reducing collinearity.
# the dependent variable
Y = conc[['strength']]

#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)
In [425]:
###### SVM
from sklearn.svm import SVR
from sklearn.preprocessing import MinMaxScaler
from sklearn import decomposition, datasets

pipeline_svr = Pipeline([
    ('scaler',MinMaxScaler()),
    ('sv', SVR())
])

print("SVR R^2 training data")
pipeline_svr.fit(x_train,y_train)
SVR(C = 1, degree = 3, gamma=0.22, kernel = 'rbf', max_iter = -1, verbose=False) #Default settings
svr_R2_train = pipeline_svr.score(x_train,y_train)
print(svr_R2_train)
#R2 for SVR training set.
print("SVR R^2 cross validation")
print(svr_R2_train)




folds = KFold(n_splits = 10, shuffle = True, random_state = 100)
svrscores = cross_val_score(pipeline_svr, x_train, y_train, scoring='r2', cv=folds)
print(svrscores)  
#Cross-validation R2 mean, and 2 standard deviations
svr_R2_val_cross_val= "R^2: %.3f (%.3f)" % (svrscores.mean(), svrscores.std()*2)
print(svr_R2_val_cross_val)

#Mean Squared Errors (MSE) for Training Set
print("Calculate mean squared error for training data")
svr_concrete_strength_predict = pipeline_svr.predict(x_train)
print("Mean Squared Error for our Training data")
svr_MSE_train = mean_squared_error(y_train, svr_concrete_strength_predict)
print(svr_MSE_train)
#Mean Squared Errors for Validation Set
print("Mean Squared Error for our Validation data")
svr_MSE_val = mean_squared_error(y_val, pipeline_svr.predict(x_val))
print(svr_MSE_val)
SVR R^2 training data
0.740118047269592
SVR R^2 cross validation
0.740118047269592
[0.70644071 0.70243761 0.61698167 0.68293162 0.71904871 0.68442849
 0.79439835 0.66701733 0.77276578 0.70946812]
R^2: 0.706 (0.096)
Calculate mean squared error for training data
Mean Squared Error for our Training data
68.2703780491566
Mean Squared Error for our Validation data
91.28737767960746

The R^2 value is not as high as the other models. The mean square error is really far from 0. There is overfitting, and after trying to tune the hyperparameters, the training and validation values don't seem to change. Perhaps, this is a good model to use for GridSearch CV.

In [426]:
#Testing on the test data

print("R-squared for test data")
svr_R2_test = pipeline_svr.score(x_test,y_test)
print(svr_R2_test)

print("MSE of model on test data")
svr_MSE_test = mean_squared_error(y_test, pipeline_svr.predict(x_test))
print(svr_MSE_test)
R-squared for test data
0.7584679599456011
MSE of model on test data
71.1237511653749
In [449]:
SVR ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'SVR':[svr_R2_train, svr_R2_val_cross_val, svr_R2_test, svr_MSE_train, svr_MSE_val, svr_MSE_test]}
SVR_= pd.DataFrame(SVR)
SVR_
Out[449]:
Metrics SVR
0 R2_train 0.740118
1 R2_val_cross_val (+2 sd) R^2: 0.706 (0.096)
2 R2_test 0.758468
3 MSE_train 68.2704
4 MSE_val 91.2874
5 MSE_test 71.1238

K Nearest Neighbors

In [430]:
#K-Nearest Neighbors
from sklearn.neighbors import KNeighborsRegressor
pipeline_knn = Pipeline([
    ('scaler',MinMaxScaler()),
    ('knr', KNeighborsRegressor())
])

pipeline_knn.fit(x_train,y_train)
KNeighborsRegressor(n_neighbors=100, p=2)
#KNN R2 Training
print("KNN R2 for Training data")
knn_R2_train = pipeline_knn.score(x_train,y_train)
print(knn_R2_train)

#K-Fold cross-validation for R2
folds = KFold(n_splits = 10, shuffle = True, random_state = 100)
knnscores = cross_val_score(pipeline_knn, x_train, y_train, scoring='r2', cv=folds)
print(knnscores) 
print("KNN, R2, cross-validation")
knn_R2_val_cross_val = "R^2: %.3f (%.3f)" % (knnscores.mean(), knnscores.std())
print(knn_R2_val_cross_val)

#Mean Squared Errors (MSE) for Training Set
print("Calculate mean squared error for training data")
knn_concrete_strength_predict = pipeline_knn.predict(x_train)
print("Mean Squared Error for our Training data")
knn_MSE_train = mean_squared_error(y_train, knn_concrete_strength_predict)
print(knn_MSE_train)
print("Mean Squared Error for our Validation data")
knn_MSE_val = mean_squared_error(y_val, pipeline_knn.predict(x_val))
print(knn_MSE_val)
KNN R2 for Training data
0.8846792072976158
[0.80728762 0.81175462 0.69937797 0.71991783 0.8345524  0.70126887
 0.84332412 0.81760486 0.85250545 0.85546211]
KNN, R2, cross-validation
R^2: 0.794 (0.059)
Calculate mean squared error for training data
Mean Squared Error for our Training data
30.294501145631067
Mean Squared Error for our Validation data
59.42099811650485

K Nearest Neighbors model is pretty good for this data, despite high overfitting. The R2 value is good, and can perhaps be improved if we ran GridSearch CV on this. Like SVR, tuning the hyperameters manually did not have much of a difference on controlling the overfitting of the model to the data.

In [431]:
#Testing on the test data

print("R-squared for test data")
knn_R2_test = pipeline_knn.score(x_test,y_test)
print(knn_R2_test)

print("MSE of model on test data")
knn_MSE_test = mean_squared_error(y_test, pipeline_knn.predict(x_test))
print(knn_MSE_test)
R-squared for test data
0.8296745642449213
MSE of model on test data
50.155598019417475
In [497]:
KNN ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'K Nearest_Neighbors':[knn_R2_train, knn_R2_val_cross_val, knn_R2_test, knn_MSE_train, knn_MSE_val, knn_MSE_test]}
KNN_= pd.DataFrame(KNN)
KNN_
Out[497]:
Metrics K Nearest_Neighbors
0 R2_train 0.884679
1 R2_val_cross_val (+2 sd) R^2: 0.794 (0.059)
2 R2_test 0.829675
3 MSE_train 30.2945
4 MSE_val 59.421
5 MSE_test 50.1556
In [498]:
#Reset the x dataset back to the form before the linear regression model tuning because tree models like decision tree and random forest are robust.

#Separate independent features from target feature.
X = conc.drop(['age','strength'], axis=1)
# the dependent variable
Y = conc[['strength']]

#Making the test set 20% of the entire dataframe
x_remaining, x_test, y_remaining, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

#Making the validation set 20% of the entire dataframe (25% of the remaining data from the last split)
x_train, x_val, y_train, y_val = train_test_split(x_remaining, y_remaining, test_size=0.25, random_state=1)
In [499]:
#Bagging Model:

from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression
bagr = BaggingRegressor(n_estimators=100, random_state=1)  # Base estimator is DecisionTree


bagr = bagr.fit(x_train, y_train)
print("R2 Training Set")
bag_R2_train = bagr.score(x_train, y_train)
print(bag_R2_train)

#Cross-validation of R2 value on validation set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
bagscores = cross_val_score(bagr, x_train, y_train, scoring='r2', cv=folds)
print(bagscores)
print("Cross-Validation R2")
bag_R2_val_cross_val = "R^2: %.3f (%.3f)" % (bagscores.mean(), bagscores.std())
print(bag_R2_val_cross_val)
print("")

#Mean Squared Errors (MSE) for Training Set
print("Calculate mean squared error for training data")
bag_concrete_strength_predict = bagr.predict(x_train)
print("Mean Squared Error for our Training data")
bag_MSE_train= mean_squared_error(y_train, bag_concrete_strength_predict)
print(bag_MSE_train)
print("Mean Squared Error for our Validation data")
bag_MSE_val= mean_squared_error(y_val, bagr.predict(x_val))
print(bag_MSE_val)
R2 Training Set
0.9837707214415985
[0.68658224 0.9154693  0.95748554 0.93224564 0.88807742 0.95206685
 0.90634745 0.93637376 0.85994538 0.89854458 0.81738289 0.74933175
 0.9027082  0.89727068 0.83905986 0.85933628 0.88940489 0.95136151
 0.94068931 0.89933231 0.97267986 0.86637439 0.51077139 0.80221006
 0.8690555  0.75268774 0.91430265 0.87060934 0.96363483 0.91814611
 0.91351308 0.84473803 0.96460188 0.96529085 0.93706062 0.85386034
 0.96003057 0.84814449 0.86025824 0.95948652 0.82604849 0.87303513
 0.92876584 0.96039847 0.88409817 0.92929369 0.84356786 0.79395554
 0.96914715 0.76589699]
Cross-Validation R2
R^2: 0.880 (0.083)

Calculate mean squared error for training data
Mean Squared Error for our Training data
4.263393325339971
Mean Squared Error for our Validation data
28.548922897223704
In [500]:
print("Bagging Model R^2 Value for Test Data")
bag_R2_test = bagr.score(x_test, y_test)
print(bag_R2_test)
print("Bagging Model MSE Value for Test Data")
bag_MSE_test = mean_squared_error(y_test, bagr.predict(x_test))
print(bag_MSE_test)
Bagging Model R^2 Value for Test Data
0.9063471504025455
Bagging Model MSE Value for Test Data
27.577881465322047

After manually attempting to tune the hyperparameters, the model and results do not change much. The n_estimators, whether 100 or 1000 did not show any difference.

In [501]:
Bagging ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Bagging':[bag_R2_train, bag_R2_val_cross_val, bag_R2_test, bag_MSE_train, bag_MSE_val, bag_MSE_test]}
Bagging_= pd.DataFrame(Bagging)
Bagging_
Out[501]:
Metrics Bagging
0 R2_train 0.983771
1 R2_val_cross_val (+2 sd) R^2: 0.880 (0.083)
2 R2_test 0.906347
3 MSE_train 4.26339
4 MSE_val 28.5489
5 MSE_test 27.5779
In [502]:
#Gradient Boost Model


from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
gbr = GradientBoostingRegressor(n_estimators=100, random_state=1)  # Base estimator is DecisionTreeRegressor, max_depth=3
gbr = gbr.fit(x_train, y_train)
print("R2 training")
gb_R2_train = gbr.score(x_train, y_train)
print(gb_R2_train)
#Cross-validation of R2 value on validation set
folds = KFold(n_splits = 50, shuffle = True, random_state = 100)
gradbscores = cross_val_score(gbr, x_train, y_train, scoring='r2', cv=folds)
print(gradbscores)  
gb_R2_val_cross_val= "R^2: %.3f (%.3f)" % (gradbscores.mean(), gradbscores.std())
print(gb_R2_val)
print("")

#Mean Squared Errors (MSE) for Training Set
print("Calculate mean squared error for training data")
gradb_concrete_strength_predict = gbr.predict(x_train)
print("Mean Squared Error for our Training data")
gb_MSE_train = mean_squared_error(y_train, gradb_concrete_strength_predict)
print(gb_MSE_train)
#Mean Squared Error for Validation Set
print("Mean Squared Error for our Validation data")
gb_MSE_val= mean_squared_error(y_val, gbr.predict(x_val))
print(gb_MSE_val)
R2 training
0.9550564818798402
[0.64258665 0.9677062  0.95589083 0.89315358 0.89401964 0.93853158
 0.8977251  0.87824418 0.93067015 0.8558253  0.84825835 0.77816698
 0.93204278 0.87379212 0.86324817 0.75731046 0.82959778 0.97935409
 0.9041985  0.86230854 0.92910242 0.89197541 0.72972661 0.89087065
 0.885467   0.65114558 0.83632215 0.79526811 0.93730008 0.82464094
 0.94004439 0.89528271 0.94673035 0.95306132 0.93770195 0.84652838
 0.94669139 0.88682519 0.83255764 0.95485962 0.79196778 0.93652517
 0.87530187 0.95482743 0.84632354 0.95019739 0.88493949 0.66260642
 0.94247646 0.88070785]
R^2: 0.861 (0.095)

Calculate mean squared error for training data
Mean Squared Error for our Training data
11.806556556489268
Mean Squared Error for our Validation data
28.560811771135846
In [503]:
print("Gradient Boost R^2 Value for Test Data")
gb_R2_test = gbr.score(x_test, y_test)
print(gb_R2_test)
print("Gradient Boost MSE Value for Test Data")
gb_MSE_test = mean_squared_error(y_test, gbr.predict(x_test))
print(gb_MSE_test)
Gradient Boost R^2 Value for Test Data
0.9080880111696099
Gradient Boost MSE Value for Test Data
27.06525155509415
In [504]:
Gradient_Boost ={'Metrics':['R2_train', 'R2_val_cross_val (+2 sd)', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Gradient_Boost':[gb_R2_train, gb_R2_val_cross_val, gb_R2_test, gb_MSE_train, gb_MSE_val, gb_MSE_test]}
Gradient_Boost_= pd.DataFrame(Gradient_Boost)
Gradient_Boost_
Out[504]:
Metrics Gradient_Boost
0 R2_train 0.955056
1 R2_val_cross_val (+2 sd) R^2: 0.874 (0.079)
2 R2_test 0.908088
3 MSE_train 11.8066
4 MSE_val 28.5608
5 MSE_test 27.0653
In [505]:
Metrics_Dataframe = pd.merge(Linear_Regression_,Decision_Tree_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,Random_Forest_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,SVR_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,KNN_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,Bagging_,how='outer',on='Metrics')
Metrics_Dataframe = pd.merge(Metrics_Dataframe,Gradient_Boost_,how='outer',on='Metrics')
Metrics_Dataframe
Out[505]:
Metrics Linear_Regression Decision_Tree Random_Forest SVR K Nearest_Neighbors Bagging Gradient_Boost
0 R2_train 0.825892 0.827275 0.931105 0.740118 0.884679 0.983771 0.955056
1 R2_val_cross_val (+2 sd) R^2: 0.782 (0.196) R^2: 0.696 (0.361) R^2: 0.834 (0.192) R^2: 0.706 (0.096) R^2: 0.794 (0.059) R^2: 0.880 (0.083) R^2: 0.874 (0.079)
2 R2_test 0.829232 0.796747 0.868448 0.758468 0.829675 0.906347 0.908088
3 MSE_train 45.7378 45.3744 18.0986 68.2704 30.2945 4.26339 11.8066
4 MSE_val 51.4099 55.5332 36.1036 91.2874 59.421 28.5489 28.5608
5 MSE_test 50.2859 59.8518 38.738 71.1238 50.1556 27.5779 27.0653

Ultimately, when looking at all these models. They all overfit. However, some are not that bad. For example, the linear regression model is very minimally overfit. Decision Tree, and Gradient Boost had little overfit too. SVR and K Nearest Neighbors are the most overfit. Overall, Gradient Boost seem to have the best results because overfitting wasn't much. A lot of variation from the concrete strength mean was explained by the independent variables, which is evidenced by the high R^2 values in all 3 parts of the data. The MSE is also relatively low compared to all the models, but not to the point where overfitting is a really huge issue.

Techniques employed to squeeze that extra performance out of the model without making it overfit. Use Grid Search or Random Search on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above. (15 marks)

GRID SEARCH CV to the Gradient Boost Model

In [507]:
#Performing GridSearch CV on the Gradient Boost Model Validation Set

from sklearn.model_selection import GridSearchCV
search_grid={'n_estimators':[100,200,300,500],'max_depth':[1,2,4,6,8],'random_state':[1]}
gsgradientb=GridSearchCV(estimator=gbr,param_grid=search_grid,scoring='r2',n_jobs=1,cv=10)

gsgradientb.fit(x_val, y_val)

#Compare the results to the training using this Gridsearch CV technique.
grids_gradientb_R2_train = gsgradientb.score(x_train, y_train)
print(grids_gradientb_R2_train)
0.8543969085881351
In [508]:
#Finding the best parameters
gsgradientb.best_params_
Out[508]:
{'max_depth': 2, 'n_estimators': 200, 'random_state': 1}
In [509]:
#This is R2 mean for the validation set iterations.

grids_gradientb_R2_val = gsgradientb.best_score_
grids_gradientb_R2_val
Out[509]:
0.8134864928556139
In [510]:
#Mean Squared Error for Training Set
print("Mean Squared Error for Training set")
grids_gradientb_MSE_train = mean_squared_error(y_train, gsgradientb.predict(x_train))
print(grids_gradientb_MSE_train)
#Mean Squared Error for Validation Set
print("Mean Squared Error for our Validation data")
grids_gradientb_MSE_val= mean_squared_error(y_val, gsgradientb.predict(x_val))
print(grids_gradientb_MSE_val)
Mean Squared Error for Training set
38.249589828677756
Mean Squared Error for our Validation data
10.120574742780207
In [511]:
#R^2 for test set
print("R2 for test set.")
grids_gradientb_R2_test= gsgradientb.score(x_test, y_test)
print(grids_gradientb_R2_test)
#Mean Squared Error for Test Set
print("Mean Squared Error for Test data")
grids_gradientb_MSE_test = mean_squared_error(y_test, gsgradientb.predict(x_test))
print(grids_gradientb_MSE_test)
R2 for test set.
0.8704023209045224
Mean Squared Error for Test data
38.16252732979385

Despite GridSearch being a strong technique used for hyperparameter tuning, the result generated here for the gradient boost model is worse. Although GridSearch gives the best parameters for the model, it does not guarantee not underfitting or overfitting. The gradient boost model constructed before seems to overfit very little compared to other models. This gradient boost-grid search cv model is actually underfitting because the validation set MSE score is so much less than the the MSE for the training data. Both values should be ideally similar.

The MSE from the test set is more similar quantitatively to the mean squared error for the training set.

Gridsearch also has certain limitations in that we have to input the values for the parameters we would like to test. It could be possible that the best parameters are not even among the ones we put into the model. That is a limitation of this technique.

In [512]:
GridS_Gradient_Boost ={'Metrics':['R2_train', 'R2_val', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Grid_Search_Gradient_Boost':[grids_gradientb_R2_train, grids_gradientb_R2_val, grids_gradientb_R2_test, grids_gradientb_MSE_train, grids_gradientb_MSE_val, grids_gradientb_MSE_test]}
GridS_Gradient_Boost_= pd.DataFrame(GridS_Gradient_Boost)
GridS_Gradient_Boost_
Out[512]:
Metrics Grid_Search_Gradient_Boost
0 R2_train 0.854397
1 R2_val 0.813486
2 R2_test 0.870402
3 MSE_train 38.249590
4 MSE_val 10.120575
5 MSE_test 38.162527
In [ ]:
 

GRID SEARCH CV to the Random Forest Model

I originally thought that Random Forest would perform well because it is such a robust model. Standardization/Normalization is generally not necessary because the model is so robust. However, I was surprised how easy it was to have an overfit Random Forest Model. I decided to see if Grid Search could help improve the model by searching for the best parameters to change.

In [513]:
#Applying Grid Search CV to the Random Forest Model

param_grid_rf = {
    'bootstrap': [True],
    'max_depth': [1, 3, 7],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4],
    'min_samples_split': [8, 10],
    'n_estimators': [100, 150, 200]
}

grid_search_rf = GridSearchCV(estimator = rfr, param_grid = param_grid_rf, 
                          cv = 10, n_jobs = -1, verbose = 2)

grid_search_rf.fit(x_val, y_val)
Fitting 10 folds for each of 72 candidates, totalling 720 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 130 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done 333 tasks      | elapsed:    8.0s
[Parallel(n_jobs=-1)]: Done 616 tasks      | elapsed:   13.5s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:   15.6s finished
Out[513]:
GridSearchCV(cv=10,
             estimator=RandomForestRegressor(max_depth=6, n_estimators=120,
                                             random_state=100),
             n_jobs=-1,
             param_grid={'bootstrap': [True], 'max_depth': [1, 3, 7],
                         'max_features': [2, 3], 'min_samples_leaf': [3, 4],
                         'min_samples_split': [8, 10],
                         'n_estimators': [100, 150, 200]},
             verbose=2)
In [514]:
##Compare the results to the training using this Gridsearch CV technique.
grids_rf_R2_train = grid_search_rf.score(x_train, y_train)
print(grids_rf_R2_train)
0.7883854386324067
In [515]:
grid_search_rf.best_params_
Out[515]:
{'bootstrap': True,
 'max_depth': 7,
 'max_features': 3,
 'min_samples_leaf': 3,
 'min_samples_split': 8,
 'n_estimators': 200}
In [516]:
grids_rf_R2_val = gsadab.best_score_
grids_rf_R2_val
Out[516]:
0.7213673325354891
In [517]:
print("Mean Squared Error for our Training data")
grids_rf_MSE_train = mean_squared_error(y_train, grid_search_rf.predict(x_train))
print(grids_rf_MSE_train)
#Mean Squared Error for Validation Set
print("Mean Squared Error for our Validation data")
grids_rf_MSE_val= mean_squared_error(y_val, grid_search_rf.predict(x_val))
print(grids_rf_MSE_val)
Mean Squared Error for our Training data
55.590647805616705
Mean Squared Error for our Validation data
33.692074262532465

Again, the validation data MSE is lower than the training MSE, suggesting possible underfitting. Although GridSearch is capable of finding the best parameters given the data you feed it, it nonetheless cannot promise not to overfit or underfit.

In [518]:
#R-square for Test Set
print("R-square for Test Set")
grids_rf_R2_test = grid_search_rf.score(x_test, y_test)
print(grids_rf_R2_test)

#Mean Squared Error for Test Set
print("Mean Squared Error for our Validation data")
grids_rf_MSE_test= mean_squared_error(y_test, grid_search_rf.predict(x_test))
print(grids_rf_MSE_test)
R-square for Test Set
0.808184499527304
Mean Squared Error for our Validation data
56.4837605901446
In [519]:
GridS_RF ={'Metrics':['R2_train', 'R2_val', 'R2_test', 'MSE_train', 'MSE_val', 'MSE_test'], 'Grid_Search_Random Forest':[grids_rf_R2_train, grids_rf_R2_val, grids_rf_R2_test, grids_rf_MSE_train, grids_rf_MSE_val, grids_rf_MSE_test]}
GridS_RF_= pd.DataFrame(GridS_RF)
GridS_RF_
Out[519]:
Metrics Grid_Search_Random Forest
0 R2_train 0.788385
1 R2_val 0.721367
2 R2_test 0.808184
3 MSE_train 55.590648
4 MSE_val 33.692074
5 MSE_test 56.483761
In [495]:
GridS_Dataframe = pd.merge(GridS_Gradient_Boost_,GridS_RF_,how='outer',on='Metrics')
print(pd.DataFrame(GridS_Dataframe))
Gradient_Boost_
     Metrics  Grid_Search_Gradient_Boost  Grid_Search_Random Forest
0   R2_train                    0.862205                   0.777830
1     R2_val                    0.820255                   0.721367
2    R2_test                    0.876375                   0.798117
3  MSE_train                   36.198468                  58.363473
4    MSE_val                    9.676008                  35.197313
5   MSE_test                   36.403791                  59.448437
Out[495]:
Metrics Gradient_Boost
0 R2_train 0.953889
1 R2_val_cross_val (+2 sd) R^2: 0.861 (0.095)
2 R2_test 0.904304
3 MSE_train 12.1132
4 MSE_val 29.6077
5 MSE_test 28.1796

Although Grid Search can find the so-called best parameters of a model whose parameters are specified, the original Gradient Boost model seems to be the best to answer the question. Overfitting is only a little, high R^2 value for train, validate and test sets that meet the requirements of a 85-95% accuracy model (R^2). The MSE train, validate and test are all small numbers but not too small to be completely overfit. If the Gradient Boost model were to advance into production, the mean R2 value would be expected to fall 95% of the time within the confidence interval of (0.766-0.956). This is a good range for R^2 values because it assures that the dependent variable's averages can be explained by the independent variables studied. Additionally, minimal overfitting is achieved as well in this model. Low MSE scores suggest the model's ability to understand the data and is generalizable to unseen data.